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SI Priors for the groups 



Our goal is to classify observed allelic read counts at each site and each tissue 
into one of the three groups. We want the groups to represent (i) no ASE 
(group AT) where both alleles are (almost) equally expressed, (ii) strong ASE 
(group S) where one of the alleles is expressed very little if at all, and (iii) 
moderate ASE (group A4) that represents everything in between the first two 
groups. In the main text we propose the following priors for the reference 
allele read count frequencies of these groups: 

9(Af) ~ Beta(2000,2000), 
9{M) ~ - Beta(36, 12) + - Beta(12, 36), 

d(S) ~ \ Beta(80, 1) + \ Beta(l, 80). 

Figure SI shows the densities of these priors together with the regions of the 
read count frequency space where each of the group is dominating the other 
two by at least a factor of 10. We see that our choices for prior parameters 
satisfy our goal since: 

• (i) group M dominates in the small region (0.47,0.53) around 0.5, 

• (ii) group S dominates at extreme frequencies of < 0.07 and > 0.93, 

• (iii) group M. dominates at nearly all the remaining frequencies: (0.10,0.46) 
and (0.54,0.90). 

Truncated prior. Our implementation allows to truncate each Beta- 
distribution on a user-specified interval in order to make the support of the 
different groups non-overlapping. This is useful especially when one-sided 
priors are used. For example, if we are studying non-sense mediated decay 
and want that ASE is called only if the reference allele shows read count 
frequency over 0.5, we could use the following one-sided truncated priors: 

6 (AT) ~ Beta(2000,2000)/[ 0 ,o.52), 
6{M) ~Beta(36,12)/ [0 .52,o.95), 
9(S) ~Beta(80,l)/ [0 .95,i.o], 

where I[ at b) denotes truncation of the distribution on the interval [a, b). 
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Independent tissues. Our implementation allows relaxing the assump- 
tion that all tissues in one group have exactly the same reference allele read 
count frequency. This is done by modeling each tissues-specific 9 S as an in- 
dependent draw from the corresponding prior for the group. This is useful 
when we have informative data with a large number of reads for each tissue 
and the tissues within one group do not have exactly the same value for 9. 
On the other hand, with a small number of reads per tissue the basic GTM 
(without independence assumption) is our default choice because it allows 
borrowing strength across the tissues in the same group. 

S2 Gibbs sampler for GTM 

We use a Gibbs sampler algorithm to explore the posterior distribution of 
configuration 7 G {A/", Ai, S} T , where T is the number of tissues. We denote 
by tth the (fixed) prior probability of heterogeneity states. (In the main text 
we use tth = 0.25.) As in the main text, we denote by y the observed read 
count data at one site and across all tissues. 

We fix the number of iterations n iter = 2, 000 and the number of burn- in 
iterations n\, UTn = 10 and run the following Gibbs sampler. 

1. Initialize 7 = (^f[°\ ■ ■ ■ , 7t^) with a random configuration. 

2. Repeat for t = 1, 2, . . . , (n burn + n iter ): 
For s = 1,2, ...,T: 

• Compute probability vector 

where for each group G G {Af,M,S}, 

P i I ) (G)<xf(y;^(G))n(^(G)). 

Here f(y;*f) is the beta-binomial marginal likelihood for the data 
given the group indicators 7 and the prior distributions for 9 pa- 
rameters of each group; 77(7) is the prior probability of the config- 
uration 7, which is determined by it H together with the distance 
(£(7); and 

^G)=(7? ) ,...,7i t i,G, 7 £i 1) ,...,7r i) )- 
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• Generate 



{Af, with probability p s (AT) 
M, with probability p^(M) 
S, with probability p ( 's\S). 



S3 Hierarchical model (GTM*) 

We extend the grouped tissue model (GTM) defined in the main text to the 
case where many variants with similar properties (such as protein truncating 
variants) are analyzed simultaneously. We add one level of hierarchy to 
the model by introducing vector 7r = (ir^, ttm, ns, ^ho, ^hi) that determines 
the proportion of variants in each of the five states defined in the main 
text (N=NOASE, M=MODASE, S=SNGASE, H0=HET0 and H1=HET1). 

Denote by yW = ((y{%y$) (y^,^)) the reference (1) and non- 
reference (2) allele counts for variant £ over available tissue types, and by 
y — (y^)e=i an the data over all L variants. 

This extension, called GTM*, is the following model, over variants £ = 
1,...,L and tissues s — 1, . . . , Tf 



6^\M) ~ Beta(2000, 2000) 



- Beta(36, 12) + - Beta(12, 36) 



7 



° {e \<S) ~ ^ Beta(80, 1) + ^ Beta(l, 80) 

7T ~ Dirichlet(l,l, 1, 1, 1) 

7 = NOASE, with probability tt^ 

7 = MODASE, with probability ttm 

7 = SNGASE, with probability ir s 

7 G HETO, with probability ( T( _^ Tt 



7 |w 



' £ /3l)ho(d(7)) 

{ 7 G HET1, with probability [Te/2 ]^ {dm 



where cf (7) is the distance of configuration 7 from homogeneity (see the main 
text) and ho(d) is the number of configurations belonging to state HETO and 
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having distance d from homogeneity, (similarly hi(d) for HET1 configura- 
tions). The values (TJj — [T^/3]) and \Te/2\ are the maximum distances 
among all configurations in HETO and HET1, respectively. In other words, 
we directly model the probability of the three homogeneous states by n N , 
ir M and n s and we distribute the probability (itho an d Km) among each het- 
erogeneous state uniformly with respect to the distance, and also uniformly 
among the configurations with the same distance. This model is slightly dif- 
ferent from our original GTM as the probabilities tcho for HETO and tvhi 
for HET1 states have been separated from each other. In settings where we 
want to follow the exact prior structure of GTM, our implementation also 
makes it possible to run GTM* parameterized with a single heterogeneity 
probability tth = Km + ^hi- This mode can be invoked by simply specifying 
the Dirichlet prior for ir with four parameters instead of five. 

We have implemented GTM* through a Gibbs sampler, which follows the 
algorithm given above for GTM with an additional Gibbs update for ir with 

77 ~ Dirichlet (n NO ASE + 1, ^modase + 1, ^sngase + 1, ^heto + 1, "<heti + 1), 

where each ns denotes the number of variants currently assigned to state S. 

An advantage of GTM* over variant specific analyses using GTM is that 
the posterior distribution of 7r is available. We expect that the posterior of 7r 
using GTM* is more accurate than averaging the variant specific posteriors 
from GTM, and, importantly, properly accounts for uncertainty in these 
estimates. However, when read counts are not very small, (say we have 30 or 
more reads per tissue per variant), we expect that the two approaches give 
fairly similar estimates. We next give some comparisons between GTM* and 
GTM approaches to inference about 7r. 

S4 Comparing GTM and GTM* 

First we analysed the simulated data of the main text with GTM* (1,000 
data sets per T = 5,10,30 tissues and n = 10,50 reads and each of the 
nine scenarios). We present the posterior expectation of 7r from GTM* in 
Figure S2, together with the original GTM results from the main text, which 
average the individual state posteriors across the 1,000 data sets. 

The results show that with 50 reads GTM* correctly infers the true state 
even in scenarios which were not completely solved by GTM. Also for 10 
reads, GTM* improves the proportion estimate compared to GTM in most 
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cases. A notable exception is scenario 5, which according to GTM* is al- 
most completely in HET1 state whereas the data sets were simulated with 
a HETO state. This phenomenon happens because the prior probability of 
HET1 state has been separated from HETO in GTM* and thus, under GTM*, 
any one tissue-specific configuration in HET1 state has a higher prior prob- 
ability than a tissue-specific configuration in HETO state (as there are fewer 
such configurations in HET1 than in HETO). Thus, if data have little in- 
formation to distinguish between a configuration in HETO and another one 
in HET1, then GTM* tends to prefer the HET1 state. On the other hand, 
GTM gives the same prior probability for every tissue-specific configuration, 
whether it belongs to HETO or HET1 state. When the latter property of the 
model is considered more appropriate, one can run our GTM* implementa- 
tion parameterized with combined heterogeneity probability hh = ^ho + if hi 
by simply specifying the Dirichlet prior for ir with four parameters instead 
of five. More importantly, when the amount of information increases, the 
small differences between the two prior specifications become insignificant, 
as shown by the results with 50 reads in Figure S2. 

The above comparison shows how much GTM* estimation of 7r differs 
from GTM in an extreme case where all the variants analysed belong to the 
same underlying state. More realistically, variants would represent different 
states, and in that case we expect that the difference between GTM* and 
GTM decreases. To compare the approaches on such a setting we randomly 
subsampled from among our simulated data sets for T = 10 tissues and for 
both 10 and 50 read counts per tissue, 50 collections of 200 variants with 
the following proportions of states: 10% NOASE (from scenario 1), 30% 
MODASE (from scenario 2), 40% HETO (from scenario 8) and 20% HET1 
(from scenario 9). The 50 point estimates of the proportions by GTM* and 
GTM together with the true values are show in Figure S3. 

For 10 reads per tissue, both GTM* and GTM underestimate the propor- 
tion of heterogeneous variants and overestimate the homogeneous one. This 
is in line with the principle that with insufficient information we prefer homo- 
geneous states. GTM* is notably more accurate than GTM with MODASE 
and HET1 states while the opposite is true with NOASE and HETO states. 

For 50 reads per tissue, both approaches give accurate estimates for prac- 
tical purposes, but GTM* is more accurate than GTM. 

We conclude that when many variants are available and we are interested 
in the state proportions 7r, we should apply GTM* to estimate 7r together 
with its uncertainty. However, GTM is both an essential building block for 
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GTM* and an important model on its own, since it is quick to run, easy to 
understand and requires data on only a single variant. For these reasons, we 
have devoted the main text of this work to GTM. 



S5 Combinatorics of configurations 

Consider T tissues and a configuration 7 = (71,..., 7t) where each 7 S e 
{M,M,S}. All together there are 3 T configurations of which 3 are homoge- 
neous and 3 T — 3 are heterogeneous. Total number of HET1 configurations 
is 2 T — 2 and hence the number of HETO configurations is 3 T — 2 T — 1. 
Consider the configurations at distance d from homogeneity, where 

d — T — max{4, t M , ?s} with £ G = #{s : ls = G} 

being the number of tissues in group G G {J\f, A4, S}. Denote the three 
counts (£n,£m, £s) in ascending order by i < d — i <T — d whence 

max{0,2d-T} < % < [d/2\. 

The number of heterogeneous configurations at distance d is 



L / ^ 1 / T \ 



=max{0,2d-T} 



i (d - i) (T - d)J (4 - d — i,T — d})\ 



where the first term in the sum is the multinomial coefficient telling how 
many ways there are to split T tissues among the given group counts, and 
the second term multiplies by 6, 3 or 1 according to whether all three counts 
are different, exactly two of the counts are equal to each other or all of the 
counts are equal. 

The number of HET1 configurations at distance d — 1, . . . , [T/2\ is 



hi(d) 



T\ 2! 



d) (3-#{d,T-d})\ 



Using the above derived formulae, the number of HETO configurations at 
distance d is h 0 (d) = h(d) — h±(d). 
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Figure SI: The top panel shows the densities for the prior distributions of the 
reference allele for the three groups: M, M and S. The lower panel shows the 
regions where each of the densities is dominating the other two by a factor of at 
least 10 and 95% highest probability regions for each of the prior distributions. 
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Figure S2: Results of GTM* and GTM on the simulated data sets of the main 
text. Each of the nine simulation scenarios (Table 1 in the main text) is represented 
by three numbers of tissues (5, 10, 30) and two values for number of reads (10, left 
columns and 50, right columns). Each bar is divided into five colors (map given 
at the bottom) according to the posterior expectation of the state probabilities, 
7r, for GTM* and the (average) posterior probability of the five states for GTM. 
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Figure S3: Fifty collections of 200 variants with 10 tissue types were analysed 
and the estimates of the proportions of variants in each of the five states are 
shown for GTM* (posterior expectation of 7r) and for GTM (average over variant 
specific state posteriors). The true proportions are shown with horizontal lines. 
The analyses were done for both 10 and 50 reads per tissue per variant. 
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